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Since the era of learning to write by human beings, mistakes made in typing 
words have occupied a privileged place in linguistic studies, integrating new 
disciplines into school curricula such as spelling and dictation. According to 
exhaustive studies that we have done in the field of spellchecking errors made in 
typing Arabic texts, very few research works that deal with typographical errors 
specifically caused by the insertion or missing of the blank-space in words. On 
the other hand, spelling correction software remains ineffective for handling this 
type of errors. Failure to process errors due to the insertion/missing of blank- 
space between and in words leads and brings us back to situations of ambiguity 


Correction and incomprehension of the meaning of the typed text. To remedy this limitation 
Insertion of correction, we propose in this article an ad-hoc probabilistic method which 
Measure is based jointly on two approaches. The first approach treats the errors due to 
Missing deletion or missing of blank-space between or inside words, while the second 
Probability puts emphasis in correcting space insertion errors in a word of course in addi- 
Spelling tion to other kinds of elementary editing errors (addition, deletion, permutation 
of characters). Our new approach combines edit distance with n-gram language 
models to correct the errors already mentioned. Our new approach gave an ac- 
curacy rate that reaches 98,14% for missing blank-space errors (noted MBSE) 
and 89,5% for insertion blank-dpace errors (noted IBSE), which gives an aver- 
age correction rate of around 95,26%. These results are very encouraging and 
show the interest and the importance of our approach. 
This is an open access article under the CC BY-SA license. 
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1. INTRODUCTION 

Since the computer age, the new information and communication technologies have continued to 
advance and to evolve everyday. This revolution and progress is accompanied with an enormous amount of 
information that is daily generated and stored on media (electronic journals, emails, blogs, briefs, tutorials, 
speeches). Since then, a new style of reading and writing has emerged. In practice, the notion of spelling 
error should increasingly take a more privileged place in typing texts. Often times we type a text speeding up 
without effective control over what has been written. Currently, and thanks to spell checkers, omnipresent in 
all word processing editors, emails, information retrieval engines, applications of social media, the editor can 
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review his typed text, correct marked errors and therefore improve and assist in correct and healthy writing of 
spelling errors (typos) in order to resolve ambiguity in the typed text. In the area of natural language processing 
(NLP), the research axis of spelling correction remains the most important and oldest. The first research works 
date back to the 1960s [1]. Building automatic spell checking systems is one of the oldest applications of NLP 
technic, since, according to Mitton [2], the first automatic detection systems appeared at the beginning of the 
1960. Finding solutions to the problem of spell-checking text has long been a challenge. Several researchers 
have investigated the problem and, through their efforts, various technics and many algorithms have emerged. 
Error detection involves finding spelling incorrect words in a given text. While the correction phase consists in 
proposing the solutions that are near to the erroneous word [3]. 

The first research in the area of spelling correction attempted to modelize the notion of spelling error. 
The founding article for this modeling was proposed by Damerau [4]. According to a statistical study carried 
out on spelling errors out of context, Damerau considered that an error is a single or multiple combination of 
elementary editing operations relating to the insertion, deletion, permutation and transposition of characters 
in a lexical word. Since then, and based on this modeling, several approaches and algorithms have been 
proposed for the correction of spelling errors. Metric method initiated by Levenshtein remains the most suitable 
[5]. It makes it possible to compare in two while calculating the number of elementary editing operations 
for insertion, deletion and permutation to transform the compared into that of the comparator.This metric is 
known as the edit distance, which remains despite the relevant distance technic practically applied in all spell- 
checking functions as extensive to use in the Bioinformatics field. Subsequently, a succession of approaches 
were proposed to enrich the correction methods that we can classify them into: i) correction approach based on 
lexical similarity measurement: this category includes, for example, the edit distance, Jaccard distance Jaccard 
[6], Jaro distance [7], [8], Stoilos et al. distance [9]; ii) probabilistic correction approach such as approaches 
based on probabilistic finite state automata Oflazer [10], the alpha-code method Pollock et al. [11], Dice 
index method [12], or the correction method by n-grams decomposition, or even research-based approaches in 
lexicographic trees [13]; iii) hybrid correction approach: concept that consist in joining metric based spelling 
correction and one, which uses probabilistic language models, and learning corpora. It is in this sense that we 
have proposed a series of works on spellchecking dedicated to the Arabic language; 

Our objectives in this research work were to improve the scheduling rate and the precision rate in the 
edit distance based correction process [14]-[16], as well as to integrate the level of morphological analysis in 
the spelling correction phase [17], as well as taking into consideration the context in the correction process [18]. 
For a decade, our research team has consistently presented a series of relevant approaches in the field of spelling 
correction for errors made in typing Arabic texts [19], [20]. From our recently published study, we raised that 
all spellchecking systems remain ineffective in correcting some types of typing errors. We specifically cite 
here errors arising from improperly inserting and/or deleting the keyboard space character in a lexical word. 
Statistically speaking, several studies have shown that space errors occupy a considerable proportion among 
other types of errors, according to a statistical study on spelling errors made in typing Ordu texts, Naseem and 
Hussain [21] proved that 23% of these errors are of the blank-space insertion/deletion type, as long as Pedler 
[22] has identified that 8% of the errors made on an English language learning corpus, are of the space error 
type. Other than, these spellcheckers only provide solutions separately to segments marked as non-dictionary 
words. However, failure to process this type of error can lead to situations of ambiguity and incomprehension 
of the meaning of the typed text. This category of errors can be made in different situations, as an example: 
often the text is typed quickly without control; the writer can remove space between two successive words or 
insert space inside the word. Also following optical character recognition (OCR) application can cause such 
a situation deleting space between words. As we also raised placing/concealing space when converting PDF 
documents to another WinWord document. 

Error classification: first, we can define an error as being any lexical form which does not correspond 
to any form stored in the dictionary or generated. Based on several statistical studies of learning corpora, 
several classifications of spelling errors have been proposed. The following classes may be mentioned in 
particular to: i) typing errors: mainly due to doubling characters, inverting characters, omitting characters, 
hitting a neighboring key; ii) editing errors: due to copy / paste operations, repetition of a word and/or a 
sentence, absence of a word, or even breaking of a word; iii) usual spelling errors (dyslexia): usually due to the 
lack of correspondence between oral and written, and committed on less frequent words in the dictionary; iv) 
grammatical errors: these are lexically correct words, but they do not respect the grammatical rules of a given 
language (rules of agreement and grammar ); v) segmentation errors: this type of error that interests us in this 
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study. They come from the omission of a space (fusion) or the insertion of a space segmentation) [22]. 

As pointed out in the introduction, errors due to the space character are caused by either deleting the 
space or inserting the space: i) missing blank-space errors: In this case, of error, the editor omits to insert space 
between two words in succession which subsequently results in a merger between two words considered as 
an entry not belonging to the lexicon. Converting documents from one software to another, such as one PDF 
document to another Word document and vice versa, this type of errors often times lead to word blank-space 
deletion throughout the converted document. Example: ”naturallanguage” instead of ”natural language”; ii) 
blank-space insertion errors: Often we type the text by accelerating without taking into account what has been 
typed, the writer can inadvertently insert one or more spaces inside a word, which causes a segmentation of the 
word in several sequences that can be lexical entries or erroneous sequences. We have pointed out that space 
insertion errors come from converting PDF file type to another Word document and vice versa, or even if we 
are trying to open a document written in WinWord with an earlier version. For example, instead of obtaining 
the word ”technology” after conversion, a white space may be inserted and getting both sequences ”techn” and 
”ology” that are incorrect lexical entries. 

In reality, this kind of blank-space insertion errors disrupts the syntax and semantics of the whole 
sentence, especially when the segments generated are lexical words, which is the case for the Arabic language 
for example. In order to correct these errors, the editor’s intervention is required to detect them first and then 
correct them. For example, the meaning of this sentence ” Vaccination against COVID-19” is not the same 
after inserting the space in the word vaccination, ”Vac ci nation against COVID-19”. Based on a study we did 
on spelling errors due to insertion/missing blank-space, there are few studies that have looked at errors due to 
blank-space. Among these works are those which rely on generating all possible sequence of the erroneous 
word followed by checking if each partition exists in the dictionary [23], [24]. Alkanhal et al. [25] presented 
a method which merges the different neighboring words to the erroneous word. The result of this procedure is 
a list of the different possible combinations of this merge and subsequently select the correct merges from the 
wrong one. 


2. METHOD 

In this article, we propose a probabilistic metric method based on the edit distance and the n-gram 
language models to better correct errors caused by blank-space (insertion and deletion), in combination with 
other types of errors like elementary editing errors (addition, deletion and permutation). To deal with editing 
errors added to errors due to blank-space insertion and missing, we have developed a new approach that corrects 
this type of error. The latter combines the approach of correcting blank-space missing errors [26] with another 
that corrects blank-space insertion errors. In the following section, we will present these two approaches in 
detail. 


2.1. Correction approach for missing blank-space errors 

We recall here that our method of correcting blank-space missing errors is based on an adaptation of 
the edit distance. The principle is to detect where the space has been omitted and then proceed to the actual 
correction Yousfi et al. [26]. The position (noted pos.space) where we will insert space in the erroneous word 
Werr 1s modeled by the rule we have proposed (1). 


pos.space; = argmin Deg(wi,.,., wi) (1) 
Jelet 


Where: 
- w; is a word of the vocabulary V and Deg is the edit distance. 


= ¢1€2. . . €j is the part of Werr from the first character to the jh character. 


os Wei 
- e; is the i” character of the Werr 
- nis the word length of wer, 
After detecting where the space has been omitted, we subsequently obtain fragments: 
- Wlipos.space; = €1€2. . . Epos.space; and W2pos.space; T Epos.space;+1+++En 
= W1pos.space; ak W2space; = Werr 
where: epos.space;+1 18 the character of the word Werr which is at the position pos.space; + 1. 
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In the second phase, we check if the two words are in the lexicon. If not, then we must correct the two 
resulting segments W1pos.space; ANd W2pos.space; Which may also contain other basic editing errors. To correct 
at the same time the blank-space missing errors and other kinds of editing errors, we defined a new metric, 
noted Dyisp, based on the edit distance(2). 


Dirsh (Werr; Wi) = Min[Dea(Werr, wi); m Dealwj, Wi space; ) + min, Dea(wy, W2space; )] (2) 
j 


The best solutions are those that satisfy (3): 


n Driisp(Werr, wi) (3) 
2.2. Correction approach for insertion blank-space errors 

To correct the errors caused by the blank-space insertion, we have developed a new method which 
relies on the edit distance and the n-gram language models [27]-[29]. This made it possible to define a new 
probabilistic measure to correct this kind of error. The principle being to detect that the marked error is due 
to an insertion of blank-space in a word and then to proceed to the effective correction. For the insertion 
error detection phase, we studied the different scenarios, such as: i) if two successive words w 1 and wa are 
wrong, then the probability is very high that the keyboard space character has been inserted in the word. In 
this situation, we join the two sequences and we proceed to correct wı + we as a single entity. Example of 
inserting a space in the word ”reda biliti”, w,=”reda” and w2="biliti”, we merge the two sequences and go to 
the correction of the merge ”redabiliti”; ii) if we have a word consisting of only one character, then very likely 
that there is a wrong insertion of the space in the word “Irresistible” which causes for example ”I rresistible”. 
However, this is not always the case with for example after segmentation by the space character of the word 
*Tris” which gives two correct segments I” and ”ris”; iii) the last case, if there is any wrong word and the next 
word is correct, in this situation the space character may have been inserted in the word. 

To identify these different scenarios, we present here two methods: i) process 1: the insertion of a 
blank-space error is only handled when there are two consecutive error words; ii) process 2: a erroneous word 
and regardless of the next word, adding the simple correction of this wrong word. We add all list solutions of 
the error caused by incorrect insertion of blank-space between this erroneous word and the two neighboring 
words. 


2.3. Correcting blank-space insertion errors according to process 1 

Let T = w wW2... Wp be a text composed of a set of arabic words as following, and suppose that we 
have two successive erroneous words w; and w;+1 in this text. Our proposal consists in joining w; and w;+1 
(wi+wi+1), and checking whether this (w;+w;+1) exists in the lexicon of our system. If so, we keep this fusion 
as a potential solution, otherwise we use our measure Dyea, which is a kind of weighting between the edit 
distance and the bi-gram language models, The distance Dy eq is defined as (4) and (5): 


Dealw;, Wk) Dea(wi41, Wk) 
Pr(wz/wi_1) Pr(wz/w;) ” 


Dwea(wi, we) = Min| 


(4) 
DealWwi + Wisi, wr) 
Pr(wg/wi—2) 
The best solutions are those that check: 
min Dwealwi, Wr) (5) 


wpkELexique 


2.4. Correcting blank-space insertion errors according to process 2 
To correct the spelling errors, taking into account as well the errors due to blank-space for the case 
mentioned in process 2, we apply a Dwea measure, which is defined this time as (6). 


Dea(wi, We) 
Pr(wz/wi1)’ 
Dea(Wi-1 + Wi, We) Dea(wi + Wi+ı, wr) 

Pr(w,/wi-2) ° Pr(wx/wi-1) 


Dwea(wi, wk) = Min| 


(6) 
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wi—1 + wi: the both concatenated words w;_; and w; 


The best corrections of the erroneous word w; are given by the following formula: 


De i, Wk 
argmin Dwealwi, we) = arg min Ped wis We) 
wre Lexique Pr(wpz /wi-1) (7) 
Dea(wi-1 + wi, wk) Dealwi + Witt We), 
Pr(we/wi-2) ° Pr(wk/wi-1) 


Example: Let this sentence containing a fragment word, ’’Python is an interp reted language program- 
ming”. We have two successive erroneous words after insertion blank-space in the word ”interpreted”, the 
processing is therefore carried out by both methods. 

i) Process 1: 
a) The spelling correction of the two erroneous sequences ”interp” and ”reted” are: 
- "interp” = {inter, inters, enter, instep, intrepid} and 
- "reted” = {rated, rented, rested, reed, retied}. the minimum distance for both is 1, so the sum 
1+1= 2. 
b) The solution interpreted”, after deleting space between sequences ”interp” and ’reted” we obtain 
this word which is a lexicon entry. 


c) Using the distance Dy ea, the minimum value of 0 and 2 is 0, and like that, the best solution is 
interpreted” (Python is an interpreted language programming’). 


ii) Process 2: 


a) The first erroneous sequence is ”interp”, so the three sets without taking into account the n-gram 
language models in the calculation: 


b) Corrections of the erroneous word ”interp” ={inter, inters, enter, instep, intrepid} 
c) Corrections of the wrong word ’aninterp” = {empty}, in reality no suggestion. 
d) The suggestion word ” interpreted’, this word is a lexicon entry. 

The last set contain the best solution of the blank-space insertion error. 


3. | RESULTS AND DISCUSSION 

To test and show the effectiveness of our Ad-hoc correction method dedicated to the correction of 
errors due to the insertion/missing of blank-space in a word while taking into account at the same time the 
other types of elementary editing (addition , deletion and permutation), we built a corpus of 1000 paragraphs 
extracted from the Wikipedia site. For the space deletion errors in combination with the editing errors, we built 
a corpus of 3000 errors. For insertion errors and other types of editing errors, we randomly make errors in 
words inside the paragraphs of our corpus. The number of errors in this corpus is 1500 insertion errors. For the 
case of errors of deletion combined with editing errors, we proceeded to correct these errors using our approach 
described in this article, but we also took a very difficult initiation by comparing our approach to correcting 
with the spellchecker integrated into the WinWord editor. 

This of course taught us a lot of time, but the results were surprising at the level of correction rate 
which reached 98.14% for our method against only 42.53% for WinWord. For the case of space insertion errors 
in combination with editing errors, we have proceeded to correct these errors using the approach described in 
section 4.2, always comparing them with the spellchecker built into WinWord. The results obtained showed 
that our probabilistic metric is the best compared to the WinWord spellchecker with a correction rate reaching 
89.5% against only 19.12% for WinWord. From what we found that the patches, not only the one built into 
WinWord, only offers solutions to segments marked wrong after inserting the space in the word and ignore the 
rest of the words in the sentence. 


4. CONCLUSION 
In this contribution, we presented the details of our spelling correction system which is based on an 
ad-hoc method for the correction of errors due to the insertion/missing of the blank-space character in words by 
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taking into consideration other types of basic typing errors. This ad-hoc method relies on a probabilistic metric 
underlying the edit distance and probabilistic language models for correcting errors due to insertion/missing of 
the blank-space. This method of correction is very effective in removing ambiguity and misunderstanding of 
the text that can be caused by errors due to space. The experimental results we have obtained on hundreds of 
space errors in combination with other types of editing errors are very satisfactory and express the validity and 
efficiency of the design choices we have predicted at the start of our study. Another advantage of our correction 
method is the very high correction rate compared to another recognized commercial corrector. 
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